Code
library(tidyverse)
library(knitr)
library(av)
library(gganimate)
library(broom)
library(kableExtra)
library(bibtex)library(tidyverse)
library(knitr)
library(av)
library(gganimate)
library(broom)
library(kableExtra)
library(bibtex)gdp_data <- read.csv("gdp_pcap.csv")
life_expectancy_data <- read.csv("sp_dyn_le00_in.csv")gdp_long <- gdp_data |>
select(country, X1960:X2022) |>
pivot_longer(cols = X1960:X2022,
names_to = "Year",
values_to = "GDP per Capita") |>
mutate(
Year = as.integer(str_remove(Year, "^X")),
`GDP per Capita` = case_when(
str_detect(`GDP per Capita`, "k") ~ parse_number(`GDP per Capita`) * 1000,
TRUE ~ parse_number(`GDP per Capita`)))
# Only selecting 1960-2022 to match life expectancy and not include predicted GDP after 2022life_expectancy_long <- life_expectancy_data |>
pivot_longer(cols = X1960:X2022,
names_to = "Year",
values_to = "Life Expectancy") |>
mutate(Year = as.integer(str_remove(Year, "^X")))combined_data <- gdp_long |>
inner_join(life_expectancy_long, join_by(country, Year))# Define country vectors by continent
# Used ChatGPT to format this (https://chatgpt.com/share/67c0f690-5020-8004-b149-ef2bae1b914e)
combined_data <- combined_data |>
mutate(
continent = factor(case_when(
country %in% c("Afghanistan", "Armenia", "Azerbaijan", "Bahrain", "Bangladesh",
"Bhutan", "Brunei", "Cambodia", "China", "Georgia", "Hong Kong, China", "India",
"Indonesia", "Iran", "Iraq", "Israel", "Japan", "Jordan",
"Kazakhstan", "Kuwait", "Kyrgyz Republic", "Lao", "Lebanon",
"Malaysia", "Maldives", "Mongolia", "Myanmar", "Nepal",
"North Korea", "Oman", "Pakistan", "Palestine", "Philippines",
"Qatar", "Saudi Arabia", "Singapore", "South Korea", "Sri Lanka",
"Syria", "Tajikistan", "Thailand", "Timor-Leste", "Turkey",
"Turkmenistan", "UAE", "Uzbekistan", "Vietnam", "Yemen") ~ "Asia",
country %in% c("Angola", "Algeria", "Benin", "Botswana", "Burkina Faso",
"Burundi", "Cape Verde", "Cameroon", "Central African Republic",
"Chad", "Comoros", "Congo, Dem. Rep.", "Congo, Rep.", "Cote d'Ivoire",
"Djibouti", "Egypt", "Equatorial Guinea", "Eritrea", "Eswatini",
"Ethiopia", "Gabon", "Gambia", "Ghana", "Guinea", "Guinea-Bissau",
"Ivory Coast", "Kenya", "Lesotho", "Liberia", "Libya",
"Madagascar", "Malawi", "Mali", "Mauritania", "Mauritius",
"Morocco", "Mozambique", "Namibia", "Niger", "Nigeria",
"Rwanda", "Sao Tome and Principe",
"Senegal", "Seychelles", "Sierra Leone", "Somalia", "South Africa",
"South Sudan", "Sudan", "Tanzania", "Togo", "Tunisia", "Uganda",
"Zambia", "Zimbabwe") ~ "Africa",
country %in% c("Albania", "Andorra", "Austria", "Belarus", "Belgium",
"Bosnia and Herzegovina", "Bulgaria", "Croatia", "Cyprus",
"Czech Republic", "Denmark", "Estonia", "Finland", "France",
"Germany", "Greece", "Hungary", "Iceland", "Ireland", "Italy",
"Latvia", "Liechtenstein", "Lithuania", "Luxembourg", "Malta",
"Moldova", "Monaco", "Montenegro", "Netherlands", "North Macedonia",
"Norway", "Poland", "Portugal", "Romania", "Russia", "San Marino",
"Serbia", "Slovak Republic", "Slovenia", "Spain", "Sweden", "Switzerland",
"Ukraine", "UK", "Vatican City") ~ "Europe",
country %in% c("Antigua and Barbuda", "Bahamas", "Barbados", "Belize",
"Canada", "Costa Rica", "Cuba", "Dominica",
"Dominican Republic", "El Salvador", "Grenada",
"Guatemala", "Haiti", "Honduras", "Jamaica", "Mexico", "Micronesia, Fed. Sts.","Nicaragua", "Panama", "St. Kitts and Nevis",
"St. Lucia", "St. Vincent and the Grenadines",
"Trinidad and Tobago", "USA") ~ "North America",
country %in% c("Argentina", "Bolivia", "Brazil", "Chile", "Colombia",
"Ecuador", "Guyana", "Paraguay", "Peru", "Suriname",
"Uruguay", "Venezuela") ~ "South America",
country %in% c("Australia", "Fiji", "Kiribati", "Marshall Islands",
"Micronesia", "Nauru", "New Zealand", "Palau", "Papua New Guinea",
"Samoa", "Solomon Islands", "Tonga", "Tuvalu", "Vanuatu") ~ "Oceania",
TRUE ~ "Other" # For countries not in any of the continent lists
), levels = c("Asia", "Africa", "Europe", "North America", "South America", "Oceania", "Other"))
) |>
drop_na()This analysis explores the relationship between GDP per capita and life expectancy to uncover the story of how a country’s wealth can shape the well-being and future of its people (Gapminder (2025b)). By evaluating these variables, we can better understand how economic growth can lead to healthier, longer lives. We aim to answer several key questions: Does an increase in GDP per capita lead to higher life expectancy? How does this relationship vary by continent? How has this relationship manifested over time?
Many studies have examined the relationship between GDP and a country’s life expectancy, consistently reaching similar conclusions that build upon one another. Research has largely found that an increase in GDP per capita leads to higher life expectancy. The International Journal of Health Sciences and Research quantifies this relationship through panel data analysis, stating that “each additional $10,000 per capita per year increases life expectancy at birth by an average of 1.8 years” (Shafi and Fatima (2019)). Further research from Georgia Tech suggests that this growth follows a non-linear pattern, with diminishing returns at higher GDP levels, where additional increases no longer significantly impact life expectancy (Shah, Akram, and Alvi (2023)). Given these findings, our hypothesis aligns with Georgia Tech’s research—we expect GDP per capita to have a strong positive effect on life expectancy, but only up to a certain threshold, after which the impact levels off (Georgia Tech (2023)). This result conveys the idea that increased economic output contributes to a healthier lifestyle with greater access to resources, however biological limitations are present as this cannot be a continuous relationship.
The GDP data set contains data on the gross domestic product per person adjusted for differences in purchasing power, in international dollars and fixed at 2017 prices. The “international dollars” currency is adjusted for Purchasing Power Parity (PPP) and is a virtual currency that enables better comparisons that allow us to compare what a dollar would buy in each country (a comparable amount of goods and services) as a U.S. dollar would buy in the United States. Additionally, GDP per capita is the gross domestic product divided by the population of the country, which gives us a rough estimate of the average annual income of the citizens. The GDP data set contains a “country” variable which has the names of each country, and the GDP per capita for each of the countries in the data set from the year 1800 to 2022.
The Life Expectancy data set contains data on life expectancy at birth in total number of years. Life expectancy at birth can be defined as the number of years a newborn infant would live if prevailing patterns of mortality at the time of its birth were to stay the same throughout its life. This data set also contains a “country” variable and the life expectancy at birth value for each country in the data set for every year from 1960 to 2022.
The most important phase of data cleaning was removing unwanted observations. After 2022, the GDP data set reported predicted values for GDP. We want our analysis to rely solely on observed values, so these years were removed. Additionally, the observations prior to 1960 were removed since no measurements for life expectancy were recorded prior to that date. Keeping the years 1960 to 2022 allowed us to have real data for both quantitative variables.
The plots below illustrate the relationship between GDP per capita (on a log scale to meet normality assumptions–further explained in later sections) and life expectancy across different continents. Figure 1 displays points representing each country within a given continent, with a red linear regression line fitted to the overall data. The plot is then faceted by continent, allowing for a comparison of how the relationship between GDP per capita and life expectancy varies across different regions of the world. Figure 2 places these regression lines on a singular plot. It is important to note that GDP is recorded in international dollars fixed at 2017 prices.
ggplot(combined_data, aes(x = log10(`GDP per Capita`), y =`Life Expectancy`)) +
geom_point(aes(color = continent), alpha = 0.7) +
geom_smooth(method = "lm", color = "red", se = FALSE) +
labs( title = "GDP and Life Expectancy Regression Faceted by Continent",
subtitle = "Life Expectancy (years)",
x = "Log of GDP per Capita",
y = " ", color = "Continent",
caption = "Figure 1") +
theme_bw() +
theme(strip.background = element_blank(), axis.text.x = element_text(size = 10)) +
facet_wrap(~continent)
ggplot(combined_data, aes(x = log10(`GDP per Capita`), y = `Life Expectancy`, color = continent)) +
geom_smooth(method = "lm", aes(group = continent), se = FALSE) +
labs(
title = "GDP and Life Expectancy Regression by Continent",
subtitle = "Life Expectancy (years)",
x = "Log GDP per Capita",
y = " ",
color = "Continent",
caption = "Figure 2"
) +
theme_bw() The graph illustrates a moderately strong, positive relationship between GDP per capita (logged) and life expectancy across all continents, indicating that wealthier countries tend to have higher life expectiencies. However, the strength of this relationship varies by region. For instance, South America and Africa exhibit a larger positive trend, while Oceania reflects a flatter slope.
This plot is similar to the one above, but allows for some more direct comparisons between regions. While this also graphs life expectancy and log GDP per capita for each region, this chart does not display individual country values. With all fitted regression lines on the same plot, we can better identify the range of log GDP values, compare slopes, and evaluate similar trends. While we can still see that South America has the highest life expectancy growth rates based on log GDP, we can also discern that Europe and Africa have similar slopes.
This animated plot displays the average GDP per capita and average life expectancy for each region. These values are then plotted over time, capturing the years 1960 to 2022. Since we are evaluating two quantitative variables, average GDP is measured on the y axis, and average life expectancy is measured in point size. Each continental region is then represented by its own color.
combined_data_avg <- combined_data |>
group_by(Year, continent) |>
summarise(
avg_life_expectancy = mean(`Life Expectancy`, na.rm = TRUE),
avg_gdp_per_capita = mean(`GDP per Capita`, na.rm = TRUE)
)
p <- ggplot(combined_data_avg, aes(x = Year, y = avg_gdp_per_capita,
size = avg_life_expectancy, color = continent)) +
geom_point(alpha = 0.7) +
scale_size_continuous(range = c(1, 5)) +
labs(
title = "Trends in Average GDP per Capita Over Time",
subtitle = "Average Life Expectancy (years)",
x = "Year",
y = "Average GDP per Capita",
color = "Continent",
size = "Average Life Expectancy",
caption = "Figure 3"
) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
scale_color_brewer(palette = "Set2") +
transition_time(Year) +
shadow_mark() + # Keeps all past years visible
ease_aes('linear') # Smooth transitions
# Animate the plot
animate(p, nframes = 100, fps = 10)From this, we can see that in general, both average GDP per capita and average life expectancy increase for all continents over time. It is important to note that this graph does not utilize log GDP values, which provides better perspective on these change over time. Europe consistently has the highest average GDP per capita, and has high life expectancy values. It is also clear that Africa makes the most significant life expectancy growth within this time period, however has very small average GDP per capita. Further, we can see that North and South America follow similar patterns throughout the 60 years.
To quanitfiy the relationship between life expectancy and GDP per capita, we used a linear regression model. We regressed life expectancy onto a log transformation of GDP per capita to model causality. The log transformation was necessary for two key reasons. First, it helps make the data more “normal” or symmetric. Since many statistical analyses, including regression, often assume normally distributed residuals, transforming GDP per capita can help meet this assumption by reducing skewness. Second, the log transformation helps satisfy the assumption of constant variance in linear modeling. Without the transformation, the variance in GDP per capita would be highly uneven, with extreme values exerting disproportionate influence on the model. By applying the log transformation, we achieve a more linear and stable relationship, improving the accuracy and reliability of our regression estimates. Figure 4 shows the relationship between the transformed GDP and life expectancy as well as the resulting Ordinary Least Squares (OLS) model. The interpretation of the model coefficients are the life expectancy when GDP per capita is 0 (\(\beta_{0}\)) and the increase in life expectancy for every 1 dollar increase (in international dollars fixed at 2017 prices) in GDP per capita (\(\beta_{1}\)).
gdp_regression <- lm(`Life Expectancy` ~ log(`GDP per Capita`),
data = combined_data)
#The data model we used for the equation
ggplot(combined_data, aes(x = log10(`GDP per Capita`),
y =`Life Expectancy`)) +
geom_point(aes(color = continent),
alpha = 0.7) +
geom_smooth(method = "lm",
color = "red",
se = FALSE) +
labs( title = "Relationship Between GDP and Life Expectancy\n(Linear Regression Model)",
subtitle = "Life Expectancy (years)",
x = "Log of GDP per Capita",
y = " ",
color = "Continent",
caption = "Figure 4") +
theme_bw() +
theme(strip.background = element_blank(),
axis.text.x = element_text(size = 10))tidy(gdp_regression) |>
kable(caption = "Linear Regression Equation Information")| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -3.606519 | 0.4837107 | -7.455943 | 0 |
log(GDP per Capita) |
7.668944 | 0.0543856 | 141.010679 | 0 |
glance(gdp_regression) |>
kable(caption = "Linear Regression Fit Information")| r.squared | adj.r.squared | sigma | statistic | p.value | df | logLik | AIC | BIC | deviance | df.residual | nobs |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0.6252438 | 0.6252124 | 6.950341 | 19884.01 | 0 | 1 | -40023.13 | 80052.27 | 80074.42 | 575725.7 | 11918 | 11920 |
For our linear regression model, we choose to use all the data we had in order to obtain a model that is the best representation of the association between life expectancy and GDP per capita. GDP per capita is still logged, as explained above, in order for the association to be linear and fit well with the model.
\[ \hat{y} = -3.607 + 7.669*log(x) \]
When the log of a country’s GDP is 0 (a GDP per capita of $1) the country is estimated to have a life expectancy of –3.607 years. This negative life expectancy is clearly unrealistic, as life expectancy cannot be below zero. This occurs because the linear regression model extrapolates beyond the range of observed data, applying the same trend to extremely low GDP values where the relationship may not hold. In reality, countries with very low GDP per capita still have positive life expectancy values, even if significantly lower than wealthier nations. This limitation highlights the importance of considering the meaningful range of the model’s predictions and the potential need for alternative modeling approaches when dealing with extreme values. Despite this, the model effectively captures the general trend, predicting that for every 1 dollar increase (in international dollars fixed at 2017 prices) in the log of GDP per capita, life expectancy increases by approximately 7.669 years.
To understand how well our OLS model explains the variability in life expectancy, we can examine how variance is distributed across observed values, predicted values, and residuals. The overall variance in life expectancy represents the total range of values across all observations. The variance in predicted values indicates the portion of this variation that is explained by GDP per capita, while the variance in residuals accounts for the unexplained differences. By analyzing these components, we can determine the proportion of variance captured by the model and evaluate its effectiveness in predicting life expectancy.
# Calculate variances
var_response <- var(combined_data$`Life Expectancy`, na.rm = TRUE)
var_fitted <- var(fitted(gdp_regression), na.rm = TRUE)
var_residuals <- var(residuals(gdp_regression), na.rm = TRUE)
# Create a dataframe for display
variance_table <- data.frame(
Statistic = c("Variance in Response", "Variance in Fitted Values", "Variance in Residuals"),
Value = c(var_response, var_fitted, var_residuals))
variance_table |>
kable(caption = "Variance in Regression Model") |>
kable_styling(full_width = FALSE, bootstrap_options = c("striped", "hover"))| Statistic | Value |
|---|---|
| Variance in Response | 128.89231 |
| Variance in Fitted Values | 80.58912 |
| Variance in Residuals | 48.30319 |
The amount of variability in life expectancy accounted for by the regression model can be calculated as the proportion of Variance in Fitted Values over the Variance in Response: 80.59 / 128.89 ≈ 0.63. This value is validated in the Linear Regression Fit Information table, as this number matches the R squared value. This tells us that 62.52% of the variability in life expectancy is explained by the model, indicating a moderate-to-strong relationship between GDP per capita and life expectancy. The remaining 37.48% of the variability (Variance in Residuals) is unexplained by the model, suggesting that there are other extraneous factors that also influence life expectancy.
The following plots extend our analysis by incorporating simulated values for life expectancy, allowing us to better understand the variability and predictive accuracy of our model. Figure 5 illustrates the observed relationship between GDP per capita (on a log scale) and life expectancy, highlighting how this association differs across continents. Figure 6, on the other hand, presents a similar scatter plot but with simulated life expectancy values generated based on our regression model, incorporating random error. This comparison provides insight into how well the model replicates real-world patterns in life expectancy.
predict_le <- predict(gdp_regression)
estimated_sigma <- sigma(gdp_regression)
rand_error <- function(x, mean = 0, sd){
errors <- rnorm(n=length(x), mean = mean, sd = sd)
y <- (x + errors)
return(y)
}
set.seed(1234)
sim_response <- tibble(sim_le = rand_error(predict_le,
sd = estimated_sigma))
map_int(combined_data, ~sum(is.na(.x))) country Year GDP per Capita Life Expectancy continent
0 0 0 0 0
data_with_predict <- combined_data |>
bind_cols(sim_response)#Graph 1 here
ggplot(combined_data, aes(x = log10(`GDP per Capita`),
y =`Life Expectancy`)) +
geom_point(aes(color = continent),
alpha = 0.7) +
geom_smooth(method = "lm",
color = "red",
se = FALSE) +
labs( title = "Relationship Between GDP and Life Expectancy\n(Linear Regression Model)",
x = "Log of GDP per Capita",
subtitle = "Life Expectancy (years)",
y = " ", color = "Continent",
caption = "Figure 5") +
theme_bw() +
theme(strip.background = element_blank(),
axis.text.x = element_text(size = 10))
#Graph 2 here
ggplot(data_with_predict,
aes(x = log10(`GDP per Capita`),
y =sim_le)) +
geom_point(aes(color = continent),
alpha = 0.7) +
geom_smooth(method = "lm",
color = "red",
se = FALSE) +
labs( title = "Relationship Between GDP and Simulated Life Expectancy",
x = "Log of GDP per Capita",
subtitle = "Simulated Life Expectancy (years)",
y = " ",
color = "Continent",
caption = "Figure 6") +
theme_bw() +
theme(strip.background = element_blank(),
axis.text.x = element_text(size = 10))Figure 5 is included here once again (same as Figure 4) to show a side-by-side comparison between the real data and simulated values. It reinforces the positive relationship between GDP per capita and life expectancy, with distinct clustering patterns across continents. Figure 6 replaces the actual life expectancy values with simulated ones, allowing us to assess how well the model captures the true underlying trend. The x-axis remains the log-transformed GDP per capita, while the y-axis now reflects the simulated life expectancy values. The strong positive correlation indicated by the red regression line suggests that the model effectively replicates the observed trend. However, some deviations between the two plots indicate the presence of variability not fully accounted for by the model. The dispersion of points across continents remains similar, suggesting that while the model provides a reasonable approximation, it may not be able to fully predict and reflect real-world complexities.
Finally, in order to evaluate the variability of R-squared values in our regression model, we utilized a simulation. This process involved generating multiple simulated versions of GDP per capita, refitting the regression model for each, and recording the resulting R-squared values. After simulating 1000 R-squared values, these were then plotted into a histogram with an overlaid normal distribution curve.
#returns one r^2 value for a simulated version of the data
get_r2 <- function(regression){
predicted <- predict(regression)
est_sigma <- sigma(regression)
sim_response <- tibble(sim_life_expectancy = rand_error(predicted, sd = est_sigma))
data_with_predict2 <- combined_data |>
bind_cols(sim_response) |>
mutate(`Log of GDP per Capita` = log10(`GDP per Capita`))
sim_regression <- lm(sim_life_expectancy ~ `Log of GDP per Capita`, data = data_with_predict2)
r2_data <- glance(sim_regression)
return(r2_data$r.squared)
}
# Taking 1000 samples
r2s <- map_dbl(.x = 1:1000, .f = ~get_r2(gdp_regression))
# Graphing the sample distribution of R-squared values
tibble(r2s) |>
ggplot(aes(x = r2s)) +
geom_histogram(aes(y = ..density..), fill = "darkgrey", alpha = 0.7) +
stat_function(fun = ~dnorm(.x, mean = mean(r2s), sd = sd(r2s)),
col = "darkorchid",
lwd = 1) +
labs(x = "R-squared Values",
y = " ",
title = "Sample Distribution of R-Squared Values from Simulated Life Expectancy",
subtitle = "Density Distribution of Simulated R-Squared Values",
caption = "Figure 7") +
theme_bw()The histogram in the plot displays the distribution of R-squared values obtained from 1,000 simulations of the regression model, with a normal curve overlaid to illustrate the theoretical distribution. The simulated R-squared values are tightly clustered around 0.625, indicating that the model consistently explains a similar proportion of variance in life expectancy across different simulations, which also aligns with the observed R-squared value from the original model of 0.63. The relatively narrow spread suggests that the model consistently explains around 62-63% of the variance in life expectancy across different samples, implying good reliability. The overlaid density curve approximates a normal distribution, further indicating that the variation in R-squared values follows a predictable pattern.
This analysis aimed to determine the existence of a relationship between GDP per capita and life expectancy. We believed there would be a positive correlation between them meaning that as countries saw their economy grow, they would see a positive change in health outcomes such as life expectancy. Our findings concurred with this and showed a strong relationship between GDP growth an life expectancy. As GDP per capita increases by one percent in a country, its life expectancy increased by 7.669 years.
We tested the strength of our model by using it to predict life expectancies based on simulated GDP data. The simulation yeilded normal residuals when compared to the actual life expectancy data. This suggests the model is an accurate representation of the trends in the data.
Our findings concur with existing research and support the general conclusion that an increase in per person wealth increases lifespan. An important thing tp note is that our model used a log transformation on GDP meaning that smaller economies will derive more benefit from the same increase in GDP per capita than larger economies.
There are limitations in our model given its simplicity. It is possible that there is reverse causation, when people have greater life expectancy they are productive members of society for longer. Additionally, there are likley missing variables, such as investment or geopolitical stability, that could help explain more of the variance in the data. These provide opportunities for further research and looking for more ways to increase life expectancy.